-
-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seq: use Floyd's combination algorithm to sample indices #479
Conversation
CC @vitiral who wrote the old implementation |
I realise that this is The partial Fisher-Yates shuffle avoids the need for comparison and should thus be faster when the time/memory needed to allocate the whole sample space is not significant; essentially this algorithm is Does this mean that for efficiency we should have three backing implementations? Not ideal. It may be worth pointing out that there is a shuffling variant of Floyd's algorithm (if |
@dhardy thanks for including me, but unfortunately I won't be able to comment for a while because of other commitments. |
Some more benchmarks:
Observations:
I'll do some more testing to try to find the best algorithm (on my system, at least). |
Note: above benchmarks are with
|
These were the best models of execution time I was able to make. I have a Gnumeric spread-sheet of the models. Text:
Unfortunately performance of the "inplace" model is hard to model, and
Of course, this is still complicated. Possibly we can multiply everything by some large-ish value (avoiding overflow) then use integer approximations — at least, where we know n,m are small. A further thing of note is that the Another thing to note: Floyd's alg is the only practical option if there is no allocator, and should alway use the least memory. For extremely large m, Floyd's alg is unusably slow (remember, it's Thoughts? It does seem silly doing a huge amount of work to find the fastest algorithm. |
I reduced the logic slightly (edit: this version has if amount < 517 {
const C: [[usize; 2]; 2] = [[1, 36], [200, 440]];
let j = (length < 500_000) as usize;
let m4 = 4 * amount;
if C[0][j] * length < (C[1][j] + m4) * amount {
sample_indices_inplace(rng, length as u32, amount as u32)
.into_iter()
.map(|x| x as usize)
.collect()
} else {
sample_indices_floyd(rng, length, amount as u32)
}
} else {
const C: [[usize; 2]; 2] = [[1, 36], [62*40, 68*40]];
let j = (length < 500_000) as usize;
if C[0][j] * length < C[1][j] * amount {
sample_indices_inplace(rng, length as u32, amount as u32)
.into_iter()
.map(|x| x as usize)
.collect()
} else {
sample_indices_cache(rng, length, amount)
}
} This now comes with a warning, though it's barely necessary:
Although we have to convert the arrays to I think it may also be worth exposing the various implementations directly. I will also investigate making the Floyd version usable without an allocator. |
PR updated. Benchmark summary:
Who'd have thought sampling a unique subset of elements would be so complex! I think it's done now. |
Interesting. The tests fail on x86 because This brings up a small dilemma — do users ever actually want to sample indices in excess of 2^32-1? Regarding actual indices, I doubt it (this would imply a byte-array in excess of 4 GiB or 32-bit word array in excess of 16 GiB). For other possible uses, I don't know (though the functions aren't exactly optimal for all uses anyway — e.g. for sampling from a range starting above 0). Possible tweaks:
|
Yes |
I chose to support
|
I noticed that we lost one feature: the output is now not fully randomised. Floyd's algorithm has a variant with full randomisation (without requiring extra random draws), but it requires insertions at arbitrary positions. Since this is only used with lengths under 517 anyway, the result isn't too bad, but it does still reduce performance:
Obviously this implies the selection heuristics need adjustment, but is it worth adding up to 77% overhead for a feature not everyone wants? Probably not. On the other hand, some people will want it, and in many cases the overhead is small or even zero. It is possible to add a |
Updated with the mentioned New benchmarks:
Model parameters are now:
|
Regarding shuffled selection, i think removing it as a feature is fine.
Make sure the docs are fixed.
Cool stuff! Sorry I'm not more involved, but very busy atm.
|
5c6be22
to
299307d
Compare
Rebased for 0.6 master branch. I think this is ready to be merged now, if the API changes are acceptable? I know it's very clunky having an extra parameter on the methods and two implementations of Floyd's algorithm, but considering the shuffling is sometimes useful and quite a lot cheaper than shuffling later, this seems a good choice? |
The heuristics seem very machine-dependent (or even compiler-flag-dependent). Any way to test them on different setups? Maybe it also makes sense to expose the different algorithms so users can pick their preferred one. |
Yes, the heuristics are very tailored to one machine. In theory the work could be extended to other platforms, but that's beyond the effort I want to go to. (I also didn't do anything to automate the model production or even to read values from the benchmarks into the spreadsheet used, so it's a bit of work for each platform. Shouldn't be that difficult however to replicate though using my spreadsheet and benchmarks.) Exposing the individual algorithms increases the API surface and makes changes harder. It doesn't seem very worth doing. |
I would like to have a good look, as it also is a bit subtle. I hope to get to it tomorrow. Sorry for the delay... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive set of benchmarks! And the spreadsheet helps 😄.
I think making the choice for shuffled results looks a bit clunky, but is a good idea overall.
I don't like the change to use only u32
, we can do better with a little effort. While it is a quite reasonable limitation, it also seems like just the thing to work in testing and fail in production.
src/seq.rs
Outdated
if C[0][j] * (length as u64) < (C[1][j] + m4) * amount as u64 { | ||
sample_indices_inplace(rng, length, amount) | ||
} else if shuffled { | ||
sample_indices_floyd_shuffled(rng, length, amount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the permutation and shuffling algorithm. It really uses the minimum number of values from the RNG possible. But in this case using shuffle
is just faster:
let mut indices = sample_indices_floyd(rng, length, amount);
rng.shuffle(&mut indices);
indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess it could be. I should have benchmarked that!
src/seq.rs
Outdated
sample_indices_inplace(rng, length, amount) | ||
if length > (::core::u32::MAX as usize) { | ||
panic!("`length` is not representable as `u32`"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not a fan of this change. All algorithms can work with similar performance as when using an usize
, but it would take a little extra complexity. It would simply make things work without special cases.
I gave it a try for Floyds algorithm here.
And played with optimizing a partial shuffle in https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs#L107 (used in sample_indices_inplace
).I am sure something we can come up with something cleaner than that that also works with sample_indices_cache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will still slow down Floyd's algorithm with small indices since the output vector is larger, hence searching it (contains
) is slower when more memory needs to be read. In particular, more results can fit in the cache with smaller indices.
This means that it can be faster to convert the output back to usize
later, although that's ugly. Only outputting Vec<u32>
was motivated by this problem.
But, yes, I get your point that we shouldn't fail arbitrarily here. I would have accepted u32
inputs only, but thought users may just do x as u32
risking invalid conversion.
So We can just use We could do what I initially implemented: We could try making We could try to make Any thoughts on this? I guess the iterator method may deserve more investigation (I'm not sure how well it would optimise). The generic method sounds both complex to implement and complex to use. The next best option might be only using |
The WASM build failed with the following panic:
|
Don't worry about that debug info error: koute/cargo-web#116 Please do @pitdicker! |
Should we document somewhere that we need at least cargo-web 0.6.14 for WASM support? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did great work here.
I feel a bit uneasy about IndexVec
, but I suppose it is a useful abstraction. Can you improve the doc comment of it to explain why it exists?
And I think it is better if sample_floyd
does nothing more than the named algorithm, with the shuffling choices abstracted out of it.
CHANGELOG.md
Outdated
|
||
### Sequences module | ||
- Optimised and changed return type of the `sample_indices` function. (#479) | ||
- Added `shuffled: bool` input parameter to `sample_indices`. (#479) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer true?
src/seq/index.rs
Outdated
/// The output values are fully shuffled. (Overhead is under 50%.) | ||
/// | ||
/// This implementation uses `O(amount)` memory and `O(amount^2)` time. | ||
pub fn sample_floyd<R>(rng: &mut R, length: u32, amount: u32) -> IndexVec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this function more 'pure'? I.e. add back the shuffled bool, and leave out the later shuffle? Then this function works as the name says, and another function can be convenient...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why would anyone else use this directly? The variable shuffling approach is important for performance.
It's not supposed to be pub
...
That seems to be more of a problem between nightlies and cargo-web. Not really something caused by rand, so I see no reason to document it. |
I was hoping to do some actual analysis of the performance of the various options. However it doesn't appear that I'll have time to do that anytime soon. I did get to the point where I have graphs showing performance of various algorithms. Graph for returning a raw In both cases the result is iterated once to make the comparison more fair. I.e. it accounts for the overhead involved in the more complex iterator required in the latter approach. This doesn't explore gains from using There's one graph for each length sampled from. On the x-axis is amount of items sampled, on the y-axis is the time (divided by amount of items sampled to make results more comparable). In both cases made three categories of algorithms:
|
Code for generating the data, as well as the raw data is in https://github.com/sicking/rand_choose_multiple_bench |
Wow, that's a lot of work you've done on this @sicking; it goes quite a way beyond what I did. This problem seems to have grown from a "small" optimisation into something quite complex. Result type: my Sorting of elements:
Which brings up another point: the skipping method does well in some tests with large input and output. But if the output must be shuffled (as my latest version requires), this will destroy the performance (not tested, but shuffling is at least There may be some applications which require stable ordering of samples; perhaps this is best met with a separate There are also applications which don't care about ordering at all. I previously argued that we should just give them fully-shuffled samples since it simplifies the API and avoids a few potential bugs; this argument makes sense if the overhead is generally low, but may not if the overhead can be large. Regarding algorithms: interesting that the main benefit of Floyd is its use of |
I don't quite understand what you're saying here. But none of the approaches I measured use truly lazy generation during iteration. In all approaches all the samples are generated before the sampling function returns. The only difference is whether a I never got to comparing time across those two approaches. But the generated times are intentionally generated such that it's meaningful to do such a comparison.
Yeah. I intentionally didn't do any attempts to measure algorithms that first generated results and then shuffled or randomized the result. It looked to me that the timings were close enough between the various algorithms that that would always be slower than using an algorithm that direct produces sorted/shuffled results. But that was an untested assumption. |
There's a significant difference though in that But I've already voiced my opinion on that aspect so I won't rehash again. |
The sample_indices_inplace algorithm is inappropriate for large numbers
(controversial)
Update with new benchmark data from `u32` impls of Floyd's and cached algorithms (inplace alg already used benchmarks from `u32` impl). Update Floyd's with a balanced model adequate for both shuffled and unshuffled versions.
Motivation: don't have to worry about overflow whichever index type is used. Appears to slightly improve some benchmarks, no affect on others.
This is to allow use of u32 or usize internally
This accounts for the "cache" method being replaced by rejection sampling and now using usize again.
Only that considering the result primarily an iterator and secondarily a vector is inappropriate, since the current API is not really compatible with generating values on demand. That's not to say we couldn't do this, but it would be fundamentally quite different. I don't consider this perfect, but it's still a lot better than the current approach, and perfect is the enemy of good as the saying goes. I'm going to merge once CI passes; there is still time for tweaking before 0.6 is released if need be. |
Adapt #144 for sampling indices
This isn't the fully generic implementation that @burdges wrote; we could make this more generic. I'm guessing we could include an arbitrary lower bound without performance overhead (since the compiler can probably eliminate a constant of 0), and could also use generic typing.
I'm not really sure if it should be committed like this. It cleans up the code and improves performance in some cases, but isn't optimal in others. We could keep the old
sample_indices_inplace
and just replacesample_indices_cache
.Edit: now: